{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0b692c73",
   "metadata": {},
   "source": [
    "# Cassandra Vector Store"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e7787c2",
   "metadata": {},
   "source": [
    "[Apache Cassandra®](https://cassandra.apache.org) is a NoSQL, row-oriented, highly scalable and highly available database.\n",
    "Newest Cassandra releases natively [support](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor(ANN)+Vector+Search+via+Storage-Attached+Indexes) Vector Similarity Search.\n",
    "\n",
    "**This notebook shows the basic usage of Cassandra as a Vector Store in LlamaIndex.**\n",
    "\n",
    "To run this notebook you need either a running Cassandra cluster equipped with Vector Search capabilities (in pre-release at the time of writing) or a DataStax Astra DB instance running in the cloud (you can get one for free at [datastax.com](https://astra.datastax.com)). _This notebook covers both choices._ Check [cassio.org](https://cassio.org/start_here/) for more information, quickstarts and tutorials."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daff81fe",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e9f8dbcb",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"cassio>=0.1.0\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "47264e32",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-10T12:20:23.988789Z",
     "start_time": "2023-02-10T12:20:23.967877Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from llama_index import (\n",
    "    VectorStoreIndex,\n",
    "    SimpleDirectoryReader,\n",
    "    Document,\n",
    "    StorageContext,\n",
    ")\n",
    "from llama_index.vector_stores import CassandraVectorStore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c692310",
   "metadata": {},
   "source": [
    "### Please provide database connection parameters and secrets\n",
    "\n",
    "First you need a database connection (a `cassandra.cluster.Session` object).\n",
    "\n",
    "Make sure you have either a vector-capable running Cassandra cluster or an [Astra DB](https://astra.datastax.com) instance in the cloud."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ba118688",
   "metadata": {},
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "\n",
      "(C)assandra or (A)stra DB?  A\n",
      "\n",
      "Keyspace name?  llama_index_demo\n",
      "\n",
      "Astra DB Token (\"AstraCS:...\")  ········\n",
      "Full path to your Secure Connect Bundle?  /path/to/secure-connect-DATABASE.zip\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import getpass\n",
    "\n",
    "database_mode = (input(\"\\n(C)assandra or (A)stra DB? \")).upper()\n",
    "\n",
    "keyspace_name = input(\"\\nKeyspace name? \")\n",
    "\n",
    "if database_mode == \"A\":\n",
    "    ASTRA_DB_APPLICATION_TOKEN = getpass.getpass('\\nAstra DB Token (\"AstraCS:...\") ')\n",
    "    #\n",
    "    ASTRA_DB_SECURE_BUNDLE_PATH = input(\"Full path to your Secure Connect Bundle? \")\n",
    "elif database_mode == \"C\":\n",
    "    CASSANDRA_CONTACT_POINTS = input(\n",
    "        \"Contact points? (comma-separated, empty for localhost) \"\n",
    "    ).strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "20933a07",
   "metadata": {},
   "outputs": [],
   "source": [
    "from cassandra.cluster import Cluster\n",
    "from cassandra.auth import PlainTextAuthProvider\n",
    "\n",
    "if database_mode == \"C\":\n",
    "    if CASSANDRA_CONTACT_POINTS:\n",
    "        cluster = Cluster(\n",
    "            [cp.strip() for cp in CASSANDRA_CONTACT_POINTS.split(\",\") if cp.strip()]\n",
    "        )\n",
    "    else:\n",
    "        cluster = Cluster()\n",
    "    session = cluster.connect()\n",
    "elif database_mode == \"A\":\n",
    "    ASTRA_DB_CLIENT_ID = \"token\"\n",
    "    cluster = Cluster(\n",
    "        cloud={\n",
    "            \"secure_connect_bundle\": ASTRA_DB_SECURE_BUNDLE_PATH,\n",
    "        },\n",
    "        auth_provider=PlainTextAuthProvider(\n",
    "            ASTRA_DB_CLIENT_ID,\n",
    "            ASTRA_DB_APPLICATION_TOKEN,\n",
    "        ),\n",
    "    )\n",
    "    session = cluster.connect()\n",
    "else:\n",
    "    raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9b97a89",
   "metadata": {},
   "source": [
    "### Please provide OpenAI access key\n",
    "\n",
    "In order use embeddings by OpenAI you need to supply an OpenAI API Key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0c9f4d21-145a-401e-95ff-ccb259e8ef84",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-10T12:20:24.908956Z",
     "start_time": "2023-02-10T12:20:24.537064Z"
    },
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "OpenAI API Key: ········\n"
     ]
    }
   ],
   "source": [
    "import openai\n",
    "\n",
    "OPENAI_API_KEY = getpass.getpass(\"OpenAI API Key:\")\n",
    "openai.api_key = OPENAI_API_KEY"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59ff935d",
   "metadata": {},
   "source": [
    "## Creating and populating the Vector Store\n",
    "\n",
    "You will now load some essays by Paul Graham from a local file and store them into the Cassandra Vector Store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "68cbd239-880e-41a3-98d8-dbb3fab55431",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-10T12:20:30.175678Z",
     "start_time": "2023-02-10T12:20:30.172456Z"
    },
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total documents: 1\n",
      "First document, id: 5b7489b6-0cca-4088-8f30-6de32d540fdf\n",
      "First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35\n",
      "First document, text (75019 characters):\n",
      "====================\n",
      "\t\t\n",
      "\n",
      "What I Worked On\n",
      "\n",
      "February 2021\n",
      "\n",
      "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined  ...\n"
     ]
    }
   ],
   "source": [
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"../data/paul_graham\").load_data()\n",
    "print(f\"Total documents: {len(documents)}\")\n",
    "print(f\"First document, id: {documents[0].doc_id}\")\n",
    "print(f\"First document, hash: {documents[0].hash}\")\n",
    "print(\n",
    "    f\"First document, text ({len(documents[0].text)} characters):\\n{'='*20}\\n{documents[0].text[:360]} ...\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd270925",
   "metadata": {},
   "source": [
    "### Initialize the Cassandra Vector Store\n",
    "\n",
    "Creation of the vector store entails creation of the underlying database table if it does not exist yet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "afc5c44f",
   "metadata": {},
   "outputs": [],
   "source": [
    "cassandra_store = CassandraVectorStore(\n",
    "    session=session,\n",
    "    keyspace=keyspace_name,\n",
    "    table=\"cassandra_vector_table_1\",\n",
    "    embedding_dimension=1536,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbabd1a7",
   "metadata": {},
   "source": [
    "Now wrap this store into an `index` LlamaIndex abstraction for later querying:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ca205b88",
   "metadata": {},
   "outputs": [],
   "source": [
    "storage_context = StorageContext.from_defaults(vector_store=cassandra_store)\n",
    "\n",
    "index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb11e2e2",
   "metadata": {},
   "source": [
    "Note that the above `from_documents` call does several things at once: it splits the input documents into chunks of manageable size (\"nodes\"), computes embedding vectors for each node, and stores them all in the Cassandra Vector Store."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04304299-fc3e-40a0-8600-f50c3292767e",
   "metadata": {},
   "source": [
    "## Querying the store"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b241797e",
   "metadata": {},
   "source": [
    "### Basic querying"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "35369eda",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-10T12:20:51.328762Z",
     "start_time": "2023-02-10T12:20:33.822688Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The author chose to work on AI because of his fascination with the novel The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also drawn to the idea that AI could be used to explore the ultimate truths that other fields could not.\n"
     ]
    }
   ],
   "source": [
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"Why did the author choose to work on AI?\")\n",
    "print(response.response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48761020",
   "metadata": {},
   "source": [
    "### MMR-based queries\n",
    "\n",
    "The MMR (maximal marginal relevance) method is designed to fetch text chunks from the store that are at the same time relevant to the query but as different as possible from each other, with the goal of providing a broader context to the building of the final answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "eb2054c0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The author chose to work on AI because he was impressed and envious of his friend who had built a computer kit and was able to type programs into it. He was also inspired by a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also disappointed with philosophy courses in college, which he found to be boring, and he wanted to work on something that seemed more powerful.\n"
     ]
    }
   ],
   "source": [
    "query_engine = index.as_query_engine(vector_store_query_mode=\"mmr\")\n",
    "response = query_engine.query(\"Why did the author choose to work on AI?\")\n",
    "print(response.response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d7bc976",
   "metadata": {},
   "source": [
    "## Connecting to an existing store\n",
    "\n",
    "Since this store is backed by Cassandra, it is persistent by definition. So, if you want to connect to a store that was created and populated previously, here is how:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "f0aae26e",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_store_instance = CassandraVectorStore(\n",
    "    session=session,\n",
    "    keyspace=keyspace_name,\n",
    "    table=\"cassandra_vector_table_1\",\n",
    "    embedding_dimension=1536,\n",
    ")\n",
    "\n",
    "# Create index (from preexisting stored vectors)\n",
    "new_index_instance = VectorStoreIndex.from_vector_store(vector_store=new_store_instance)\n",
    "\n",
    "# now you can do querying, etc:\n",
    "query_engine = index.as_query_engine(similarity_top_k=5)\n",
    "response = query_engine.query(\"What did the author study prior to working on AI?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1ceec3bf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "The author studied philosophy and painting, worked on spam filters, and wrote essays prior to working on AI.\n"
     ]
    }
   ],
   "source": [
    "print(response.response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52b975a7",
   "metadata": {},
   "source": [
    "## Removing documents from the index\n",
    "\n",
    "First get an explicit list of pieces of a document, or \"nodes\", from a `Retriever` spawned from the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "75ed7807",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = new_index_instance.as_retriever(\n",
    "    vector_store_query_mode=\"mmr\",\n",
    "    similarity_top_k=3,\n",
    "    vector_store_kwargs={\"mmr_prefetch_factor\": 4},\n",
    ")\n",
    "nodes_with_scores = retriever.retrieve(\n",
    "    \"What did the author study prior to working on AI?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "6ae9c6b1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 3 nodes.\n",
      "    [0] score = 0.42589144520149874\n",
      "        id    = 05f53f06-9905-461a-bc6d-fa4817e5a776\n",
      "        text  = What I Worked On\n",
      "\n",
      "February 2021\n",
      "\n",
      "Before college the two main things I worked on, outside o ...\n",
      "    [1] score = -0.0012061281453193962\n",
      "        id    = 2f9f843e-6495-4646-a03d-4b844ff7c1ab\n",
      "        text  = been explored. But all I wanted was to get out of grad school, and my rapidly written diss ...\n",
      "    [2] score = 0.025454533089838027\n",
      "        id    = 28ad32da-25f9-4aaa-8487-88390ec13348\n",
      "        text  = showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress ...\n"
     ]
    }
   ],
   "source": [
    "print(f\"Found {len(nodes_with_scores)} nodes.\")\n",
    "for idx, node_with_score in enumerate(nodes_with_scores):\n",
    "    print(f\"    [{idx}] score = {node_with_score.score}\")\n",
    "    print(f\"        id    = {node_with_score.node.node_id}\")\n",
    "    print(f\"        text  = {node_with_score.node.text[:90]} ...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bdc78f4-36ee-4358-9050-98f6e6652092",
   "metadata": {},
   "source": [
    "But wait! When using the vector store, you should consider the **document** as the sensible unit to delete, and not any individual node belonging to it. Well, in this case, you just inserted a single text file, so all nodes will have the same `ref_doc_id`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c52bb601-0838-46bb-8b9f-f2012a1c3f84",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nodes' ref_doc_id:\n",
      "5b7489b6-0cca-4088-8f30-6de32d540fdf\n",
      "5b7489b6-0cca-4088-8f30-6de32d540fdf\n",
      "5b7489b6-0cca-4088-8f30-6de32d540fdf\n"
     ]
    }
   ],
   "source": [
    "print(\"Nodes' ref_doc_id:\")\n",
    "print(\"\\n\".join([nws.node.ref_doc_id for nws in nodes_with_scores]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7659d4c3",
   "metadata": {},
   "source": [
    "Now let's say you need to remove the text file you uploaded:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "2c7aafa0",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "357a624b",
   "metadata": {},
   "source": [
    "Repeat the very same query and check the results now. You should see _no results_ being found:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "813276ff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 0 nodes.\n"
     ]
    }
   ],
   "source": [
    "nodes_with_scores = retriever.retrieve(\n",
    "    \"What did the author study prior to working on AI?\"\n",
    ")\n",
    "\n",
    "print(f\"Found {len(nodes_with_scores)} nodes.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fa59402",
   "metadata": {},
   "source": [
    "## Metadata filtering\n",
    "\n",
    "The Cassandra vector store support metadata filtering in the form of exact-match `key=value` pairs at query time. The following cells, which work on a brand new Cassandra table, demonstrate this feature.\n",
    "\n",
    "In this demo, for the sake of brevity, a single source document is loaded (the `../data/paul_graham/paul_graham_essay.txt` text file). Nevertheless, you will attach some custom metadata to the document to illustrate how you can can restrict queries with conditions on the metadata attached to the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "23c6ff1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "md_storage_context = StorageContext.from_defaults(\n",
    "    vector_store=CassandraVectorStore(\n",
    "        session=session,\n",
    "        keyspace=keyspace_name,\n",
    "        table=\"cassandra_vector_table_2_md\",\n",
    "        embedding_dimension=1536,\n",
    "    )\n",
    ")\n",
    "\n",
    "\n",
    "def my_file_metadata(file_name: str):\n",
    "    \"\"\"Depending on the input file name, associate a different metadata.\"\"\"\n",
    "    if \"essay\" in file_name:\n",
    "        source_type = \"essay\"\n",
    "    elif \"dinosaur\" in file_name:\n",
    "        # this (unfortunately) will not happen in this demo\n",
    "        source_type = \"dinos\"\n",
    "    else:\n",
    "        source_type = \"other\"\n",
    "    return {\"source_type\": source_type}\n",
    "\n",
    "\n",
    "# Load documents and build index\n",
    "md_documents = SimpleDirectoryReader(\n",
    "    \"../data/paul_graham\", file_metadata=my_file_metadata\n",
    ").load_data()\n",
    "md_index = VectorStoreIndex.from_documents(\n",
    "    md_documents, storage_context=md_storage_context\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4d48250",
   "metadata": {},
   "source": [
    "That's it: you can now add filtering to your query engine:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "d4bfd6f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "733467f3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "It took the author five weeks to write his thesis.\n"
     ]
    }
   ],
   "source": [
    "md_query_engine = md_index.as_query_engine(\n",
    "    filters=MetadataFilters(\n",
    "        filters=[ExactMatchFilter(key=\"source_type\", value=\"essay\")]\n",
    "    )\n",
    ")\n",
    "md_response = md_query_engine.query(\"How long it took the author to write his thesis?\")\n",
    "print(md_response.response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4847dc97",
   "metadata": {},
   "source": [
    "To test that the filtering is at play, try to change it to use only `\"dinos\"` documents... there will be no answer this time :)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
